Decision Trees Example
Let’s walk through the steps to build a simple decision tree using the Gini Index for classification. We'll work with an example dataset to classify whether a person will buy a product based on Age and Income.
Example Dataset
ID | Age | Income | Will Buy (Target) |
---|---|---|---|
1 | 25 | 40000 | No |
2 | 45 | 80000 | Yes |
3 | 35 | 60000 | Yes |
4 | 50 | 120000 | Yes |
5 | 23 | 35000 | No |
Step 1: Select the Best Split
Gini Index Formula
The Gini Index measures the impurity of a node. Lower values indicate better splits.
Where:
- ( p_i ) is the probability of samples belonging to class ( i ) at a node.
- ( k ) is the total number of classes.
Calculate Gini Index for a Split
Assume we split the dataset based on the threshold ( age ≤ 30 ).
-
Left Node (Age ≤ 30):
- Data: ID 1, ID 5
- "Will Buy" = No (2 samples, 100%)
- ( Gini = 1 - (1^2) = 0 )
-
Right Node (Age > 30):
- Data: ID 2, ID 3, ID 4
- "Will Buy" = Yes (3 samples, 100%)
- ( Gini = 1 - (1^2) = 0 )
Weighted Gini for the split:
This split perfectly separates the data, making it the best choice.
Step 2: Split the Dataset
Using ( age ≤ 30 ), divide the dataset into two groups:
- Left Node (Age ≤ 30): ID 1, ID 5
- Right Node (Age > 30): ID 2, ID 3, ID 4
Step 3: Repeat
Recursively apply the splitting process on the nodes until a stopping condition is met (e.g., max depth or minimum samples).
Step 4: Assign Leaf Values
- Left Node: All samples are "No," so the leaf value is "No."
- Right Node: All samples are "Yes," so the leaf value is "Yes."
What is Gini Index?
The Gini Index measures how often a randomly chosen element would be incorrectly classified if randomly labeled based on the distribution of classes in the node. It ranges from:
- 0 (perfect purity): All elements in a node belong to one class.
- 0.5 (maximum impurity): Classes are evenly distributed.
Key Points to Remember
- The Gini Index helps decide the best split by minimizing impurity.
- A split with a Gini Index of 0 indicates perfect separation of classes.
- The process repeats recursively, assigning class labels or values to the leaf nodes.
Code Implementation
Here’s the Python code to demonstrate this using sklearn
:
from sklearn.tree import DecisionTreeClassifier, plot_tree
import pandas as pd
import matplotlib.pyplot as plt
# Example Dataset
data = {
"Age": [25, 45, 35, 50, 23],
"Income": [40000, 80000, 60000, 120000, 35000],
"Will_Buy": ["No", "Yes", "Yes", "Yes", "No"]
}
df = pd.DataFrame(data)
# Features (X) and Target (y)
X = df[["Age", "Income"]]
y = df["Will_Buy"]
# Decision Tree with Gini Index
clf = DecisionTreeClassifier(criterion="gini", max_depth=2, random_state=42)
clf.fit(X, y)
# Visualize the Decision Tree
plt.figure(figsize=(10, 6))
plot_tree(clf, filled=True, feature_names=["Age", "Income"], class_names=["No", "Yes"])
plt.show()
To test your Decision Tree model, you need to provide new data (test samples) and use the .predict()
method to see the model's predictions. Here's how you can do it:
1. Create Test Data
Provide new Age and Income values to test the model.
# New test samples
X_test = pd.DataFrame({
"Age": [30, 40, 28],
"Income": [50000, 90000, 45000]
})
# Predict using the trained model
predictions = clf.predict(X_test)
# Display predictions
print(predictions) # Output will be an array of "Yes" or "No"
2. Evaluate Model Performance
If you have actual labels for test data, you can measure accuracy:
from sklearn.metrics import accuracy_score
# Example: Assuming these are actual labels for test data
y_test = ["Yes", "Yes", "No"]
# Get predictions
predictions = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")
You can split your dataset into training and testing sets using train_test_split
from sklearn.model_selection
. Here’s how:
1. Split the Data
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Splitting the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model on the training data
clf.fit(X_train, y_train)
# Predict on the test data
y_pred = clf.predict(X_test)
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
cross-validation for better evaluation
You can use cross-validation to evaluate the model more reliably by splitting the data into multiple training and testing sets. Here’s how you can do it using cross_val_score
from sklearn.model_selection
:
Cross-Validation Implementation
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross-validation
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# Print average accuracy and individual fold scores
print(f"Cross-validation scores: {cv_scores}")
print(f"Average accuracy: {cv_scores.mean() * 100:.2f}%")
You can evaluate Precision, Recall, F1-score, and Accuracy using cross_val_score
and cross_validate
. Here’s a single code snippet that computes all these metrics using 5-fold cross-validation:
Full Model Evaluation Code
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
# Define scoring metrics
scoring = {
'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score, pos_label="Yes"),
'recall': make_scorer(recall_score, pos_label="Yes"),
'f1': make_scorer(f1_score, pos_label="Yes")
}
# Perform cross-validation
cv_results = cross_validate(clf, X, y, cv=5, scoring=scoring)
# Print averaged scores
print(f"Accuracy: {cv_results['test_accuracy'].mean() * 100:.2f}%")
print(f"Precision: {cv_results['test_precision'].mean() * 100:.2f}%")
print(f"Recall: {cv_results['test_recall'].mean() * 100:.2f}%")
print(f"F1-score: {cv_results['test_f1'].mean() * 100:.2f}%")
To make it more easier you can use confusion matrix To evaluate the model further, we can compute the confusion matrix and classification report using Python.here is the code snippet to do that:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Sample data
data = {'Age': [25, 45, 35, 50, 23], 'Income': [40000, 80000, 60000, 120000, 35000], 'Will_Buy': ['No', 'Yes', 'Yes', 'Yes', 'No']}
X = [[25, 40000], [45, 80000], [35, 60000], [50, 120000], [23, 35000]]
y = ['No', 'Yes', 'Yes', 'Yes', 'No']
# Splitting data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
we can also use classification_report to get the precision, recall, f1-score, and support for each class. Here is how we can achieve that:
from sklearn.metrics import classification_report
# New test samples
X_test = pd.DataFrame({
"Age": [30, 40, 28,40,50 ],
"Income": [50000, 90000, 45000,80000,59000]
})
# Predict using the trained model
predictions = clf.predict(X_test)
report = classification_report(y, predictions)
print(predictions,"this is preds")
print(y,"this is your real class")
print(report)
print(predictions)